Application of BIRCH to text clustering

نویسندگان

  • Ilya Karpov
  • Alexandr Goroslavskiy
چکیده

This work represents a clustering technique, based on the Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) algorithm and LSA-methods for clustering large, high dimensional datasets. We present a document model and a clustering tool for processing texts in Russian and English languages and compare our results with other clustering techniques. Experimental results for clustering the datasets of 10’000, 100’000 and 850’000 documents are provided.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Application of modified balanced iterative reducing and clustering using hierarchies algorithm in parceling of brain performance using fMRI data

Introduction: Clustering of human brain is a very useful tool for diagnosis, treatment, and tracking of brain tumors. There are several methods in this category in order to do this. In this study, modified balanced iterative reducing and clustering using hierarchies (m-BIRCH) was introduced for brain activation clustering. This algorithm has an appropriate speed and good scalability in dealing ...

متن کامل

Advanced Split BIRCH Algorithm in Reconfigurable Network

The Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) has a disadvantage that it reduced the accuracy of the arbitrary shape clustering algorithm clusters, to this end a split improved BIRCH algorithm (AS-Birch) was put forward. Through the analysis of the reconfigurable network and a detailed analysis of application scenarios and functional requirements of business clusterin...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

Birch: an Eecient Data Clustering Method for Very Large Databases

Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identiication of clusters, or densely populated regions, in a multi-dimensional dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs. This paper presents a data clustering method named BIRCH...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012